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GREAUR AVAILABLE PltOCESSING 
POWER AND BETTER ALGORITHMS 
CONTINUE TO MAKE SPEECH 
RECOOflTJON MORE VIABLE FOR 
MANY REAL-WORLD AI>Pil€A^ 
TIONS. THE REAL ADVANCES, mw 
EVER, HAVE BEEN IN HOW YOU 
CAN USE SPEECH TO ACHIEVE 
"NATURAL (]ll0IR^mWNG." 




SPEECH RECOGNITION: 

It's not what you say; 

it's limy )vu2u) It 

SPEECH APPLICATIONS THAT DIRECT YOU TO "Say 1 if yOU Want 
to do X," miss the true power of speech. Eveo applications that 
walk throygh a script' — sometimes known as "^phone |aii"— am 
leave you wondering where the dialogue is going and whether yoti 
g jQ can get to your desired goal. Most people would rather get straight 

optmimgrecognmn perform- ^he point and say, "Call home" or "Buy 100 shares of NCC at 

am ind measwg aamxf 80 

Speaker-independent or $4.50." Fortunately, speech-recognition Today's DSPs can provide enough 

-dependent templates 82 technology can now let you use devices processing power to recognize speech 

Theothnhattofiisa3nvmatmn: ^ ^^>' alrea^l)' understand— by fromareasonablysized vocabulary with 

speech synthesis 84 talking — instead of requiring you to use sufficient accuracy. The two killer ap- 

an unfamiJiar int#rface. The key to get- plications for speech recognition this 

Knows his master s voice 86 ting the most from speech recognition year are hands-free adapters for car 

For more information 88 lies in taking advantage of developments phones, which use low-end command- 
in interface design that speedi fecagni- and-control technology, and automated 
tion enables. directory assistance, which uses high- 
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end natural-understanding technology. 
Soon, you'll be able to buy and sell stock 
over the phone without human inter- 
vention. Even toys and simple appli- 
ances cart use speech recognition. 

The most recent advances in speech 
recognition come through context and 
natural-language understanding. In 
"raised-initiative" mode, either the 
speaker Or the voice system can direct the 
conversation. With modeless or conver- 
sational systems, either side can say "any- 
thing" at any time. Users are not locked 
into a script; phrasing, word choice, and 
word order are all flexible. One impor- 
tant feature of a conversational system is 
the ability to "barge in," allowing the 
speaker to intermpt die voice system 
with either an answer to the question the 
system is currently asking or a command 
directing action in another context. For 
example, during an exchange to buy 
Itflck, tbe gpester could ask an out-of- 
amtext question, such as the price of a 
different stock. 

Where can you implement speech- 
recognition technology? Because recog- 



AT A GLANCE 

> You can find speech-recognition tech- 
nology in almost every form factor, from 
dedicated ICs to portable C algorithms for 
DSPs to turnkey boards for telephony 

^StOTS. 



> Because recognition engines turn 
speech into text, speech can serve as a 
front end for any application with a text 
iflteffice. 

t> Adding memory to the recognition 
engine can often yield faster recognition 
than adding more processing power. 

[> Dialflfw tools build conversifcnal 
interfaces that kte usfifs from corstrained 
scripts and dm ^e*ere to sptak we 
naturally. 



nition engines turn speech into text, 
speech recognition can serve as a front 
end to any system supporting a text in- 



terface. You can implement the specch- 

rccognition engine in several ways. For 
simple to complex applications, you can 
buy speech-recogjiitiDn-rpecific ICs, run 
a software engine on a DSP, or locate 
speech-recognition processing on a host 
server. Voice recognition has even pene- 
trated the Web with VXML (voice- 
extensible markup languor; www, 
vxmJforum.org). 

COMMAND ANB GONBOL 

For low-end app(icatiora lequiring 
limited vocabulary, you can buy spe«h- 
recognition-specific TCs such as Senso- 
ry's RSC-264T and RSC-364 speech- 
recognition |xCs with ADCs and DACs. 
With as much as 64 kbytes of ROM and. 
2.5 kb)'tes of RAM, the chips can support 
a combination of technologies, including 
a speaker-independent engine, a speak- 
er-dependent engine, speaker adapta- 
tion, speech verification, voice recording, 
and speech and music synthesis; eadi 
module requires 5 to 10 kbytes of mem- 
ory. Speaker-dependent templates re- 
quire 100 b)'tes per 3 sec of speech, speak- 



OPTIMIZING RECOGNITION PERFORMANCE AND INCREASING ACCURACY 



• For short vocabulary sets, 
use dissimilar-sounding words 
such as "hat," "kitten," and 
"mouse" instead of "hat," "cat," 
and "rat" to avoid word confu- 
sion. You can also vary the num- 
ber of syllables. For example, use 
"orange," "watermelon," and 
"grape" instead of "orange," 
"apple," and "cherry." 

• In general, sentences are 
easier to recognize than words, 
given that a sentence has more 
variation from other sentences 
than words do t\0'.<-' .voniv 
Longer responses, such as "Buy 
Stocks" or "View my pDrtftrfio," 
are easier to recognize than 
shorter ones, such ds "Buy" or 
"View." 

• Dynamically reducing 
vocabulary based on the current 
context increases accuracy. For 
example, if the user must supply 
a dollar amoutil, Ihe system 
need not search most of the 
vocabulary because the sp6elser 
wifi not ise it. 



• Record training samples in 

a typical target environment. This 
step includes using the same 
microphone; ambient noise; a 
closed erwironment, such as a 
car; distance of the microphone 
from the speaker; and other 
parameters. You should also 
duplicate the state of a typical 
user, both physically and emo- 
tioHdily. (For example, people 
using exercise equipment with 
speech recognition tend to 
breathe heavily, office users are 
q-jiei, aad Cdlrn, :ind chiMfer!-^ 
loud and excited.) 

• Build questions tfrat suggest 
available responses. For exam- 
ple, "What do you 'awW to do?" 
is too open a question. "Who do 
you want to call?" leads the user 
to a response that theapplica- 
tion can understand. 

• Note that, although "barge 
in" is a desir:' ' ? ' Jt !i 
requires a full-duplex interiace 
and therefore prooessor-interv- 
sive echo cancellation. 



• Structure your dialogue to 
clarify vocabulary and put it into 
smaller subsets. For example, in 
a teiephone-diredory applica- 
tion, instead of having the user 
ask for John Smilh and then for 
John's department to locate the 
right John Smith, first determine 
the department. By doing, so the 
engine must recognize a depart- 
ment from a sinall subset of the 
vocabulary and then can reduce 
the search space for the name 
from the entire directory to only 
those people in the department. 

• Discrete speech-recognition 
engines require the user to pause 
between each word or digit. Such 
pauses are unnatural and can 
even cause users to talk artif!C!.^lly 
or monotonically, reducing accu- 
racy. Continuous engines require 
more processing power but pre 
sent a better interface. 

• If thc-sp' tc' ■■ecognition 
engine does not understand an 
answer the first tim.^ it will prob- 
ably not undef^nd it a second 



time. Restructure clarification ^ 
questions, perhaps asking for a I 
different format ("sixty-four" ver- ^ 
su5 ".six, four"). In any case, som^W 
users will innocently confound 
the system. You must be able to 
identify these users and handle 
their needs, either by switching 
to a less ambiguous interface, 
such as "Press 1 to do x," or by 
routing the users to a person. 

• The ability to record a 
user's voice is important for |^^^ 
building a conversational "cfi^^^ 
mation structure." With a phone, 
for example, the name "Conrad" 
may mean a particular phone 
number. When the user trains 
the phone to understand 
"Conidd," the phone could 
record the user's voice saying, 
"Conrad" instead of reading 
back Conrad's number. In this 
way, when the speech-recogni- 
tion engine doesn't recognize 
the name "Conrad," the phone 
could ask, "Are yot* tryif^ to caS m 
Conrad?" ■ 
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er-independent templates require 
kbytes for 14 words, and synthe- | 
sis requires 1 to 2 kb)'tes per sec of 



Flgora 1 



speech or music (see sidebar "Speal<er- 
independent or -dependent templates"). 
The i^C can also serve as a general-pur- 
pose |jlC to control lights or motors. Sen- 
sory offers a development kit, including 
a chip simulator, a compiler, a linker, 
code samples, development boards, and 
a t!itorial for $3200. You can also pur- 
chase the chip as a module, and the var- 
ious algorithms are available in C for use 
in DSPs. 

As inexpensive as Sensory's chips 
arc— $2.25 (100,000) for the RSC- 
264T — you'll still find it less expensive to 
run speech-recognition software on a 
PSP if you have one available. Also, to 
perform operations such as acoustic echo 
canccflation, you need a DSP. However, 
you may find the time-to-market bene- 
fits of adding a chip versus integrating 
software worth the extra cost. 

Several DSP companies, such as Ana- 
fog Derices and Philips, offbr DSPs with 
A DCs and DACs supporting half- or fijll- 
duplex channels and various memories, 
including flash. Depending on the per- 
formance of the DSP and available mem- 
ory, yo« ean implement a variety' of fil- 
ters, speech-recognition engines, and 
vocabulary sizes. Because these DSP 
chips are programmable, you can update 
algorithms and vocabularies. Be sure to 
check with ttw chip companies' partners 
to see which algorithms they have already 
ported to the chips. Something else to 
watch out for: Some of these chips are re- 
ally multichip modules, meaning that 
stveral pieces sf silicon come iJii ^ swma 
packagi. By fratting tli*sfe ^tum — ^tfce 



[RESPONSE]="BUY 100 SHARES OF NCC AT $4.50." 
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lRESPON3E]='(3tyE M£ NATIOWAl COMPUTER CORP AT FOUR-FIFTY FOR lOQ UNJTS.' 



Parsing voice responses into semantic slots extracts meaning from complex utterances. Responses 
writh equivalent meaning yield equivalent semantic breakdowns, regardless of word order or choice 
ofi 



ing falls on the ser\'ef, where many nodes 
can take advantage of the benefits. 

Given that speech recognition is tech- 
nically feasible for a variety of applica- 
tions, vocabulary size is the key factor to 
coTKider in defining tbt characteristics of 
a speech-recognition engine. If the en- 
gine has a vocabulary of only 10 words, 
the engine has a high probability of rec- 
ognizing the right word. Such an engine 
probably doesnH have to work (hat 
hard — its easy to distinguish between 
"yes" and "no." The larger the vocabulary, 
however, the more robust the engine you 
need to distinguish between similar- 
sounding words, such as "Austin" and 
"Boston," which decrease confidence 
scores — that is, how confident the engine 
is that it accurately recognizes the 
world- — and increase recognition liirie 
(s€« sidcbai "Optimizing recognition 
pscfennaiiM and inocesiiii^ tgrarnqr"). 



DSP, corrverters, and memory — together 
yourself, you can choose the optimal bal- 
ance and cost for your application. 

On the board side, companies such as 
Radisys offer platforms for supporting 
voice recognirion, from call centers to 
dictation centers that have 500 incoming 
lines. THE Germany offers embeddable 
modules for hands-free adapters for car 
phones. Many board companies supply 
no speech-recognition software but part- 
ner with companies that do. Some boards 
offer application-specific functions, such 
as the ability to decompress compressed 
voice (voice-over-Internet Protocol ap- 
plications), handk protocols, and even 
route calls. In a server environment, you 
can run a powerful speech-recognition 
engine on the server, using a front-end 
engine to extract speech features to send 
10 the lervcr for recognition. Ln this wiy, 
the east of speech-recognition psracess- 



SPEAKER-INDEPENDENT OR -DEPENDENT TEMPLATES 



An independent speech-recogni- 
tion engine can, in ttieory, 
understand the speech of all but 
the most difficult speakers with 
accents, although you can adjust 
an engine for regional character- 
istics. Dependent engines, on 
the other hand, learn a person's 
voice and the ntiance of the per- 
son's 5pe«h, and when a sys- 
tem builds a template from this 
person, the system can use that 
temptetecinly with this psrson. 



Ar>olications such as cell 
.-..ui.-j n'i.-.iv r-mploy both tech- 
nologies. I heie lipplicaiions may 
use speakernndependent tech- 
nology for standard conimands, 
such as "call," and speaker- 
dependent technoiogy for 
learned commands, such as 
"home" or "office." Each user of 
a dependent system needs to 
enroll. Enrollment can be as sim- 
ple as a user saying a recogniz- 
able word, saying a \m words to 



determine which of x standard 
templates is best for this user, 
reading a script to provide 
enough data to build a template, 
or even supporting correction of 
inaccurately recognized words. 

You are not limited to one 
speech template. For example, 
you can dynamically select a 
template besed mi the irjitial 
stot"mc: '. a person makes, 
enatjiiiiji ;iii.: S|jfL'ch-ren..anition 
engirte to identify a regional 



accent and use a region tem- 
plate better suited to the speak- 
er. A system could also create 
templates for each user for 
applications that have multi' 
users. Each template stores an 
adaptive model of the user's 
voice. Either the user could iden 
tify himself or herself directly, 
the engine could automatic 
recognize the user as user 1 or 
user 2, with user reserved for 
unregistered users. 
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Vocabulary size also deter- 
mines how much memory 
need. You 



[RESP0NSE1='BUY IDO SHARES Of NCC AT $4.50.' 



Figart Z 



you need, you can 
store vocabularies in 
ROM if you do not expect 
them to change or if you are 
not using a dynamic model 
that swaps portions of the vo- 
cabulary into and out of 
memory as needed. For tele- 
phony systems, the larger the 
vocabulary, the more MIPS 
yqu rieed to process speech, 
and tile fewer the channels 
you can support. 

If you need fast recogni- 
tion, don't be afraid to throw 
memory at your problem. 
Memcny is often more impor- 
tant than processor speed. 
Consider a case in which only 
half of the vocabulary (the 
half us£d most fjeqyently) is 
av^ilsMe In fiftemf meorory. If 
the speaker uses words from the other 
half of the vocabulary, loading sections 
of the second half may result in a longer 
delay than a search of the entire vocabu- 
lary would take if it was readily available. 

Note thatbuildinga vocabulary, even a 
small one, is not a trivial undertaking. 
Companies building speaker-indepen- 
dent models commonly take 500 samples 
for each ward. You can find most com- 
mon words, such as"yes" and "no," off the 
shelf. With numbers, consider supporting 
more than just zero through nine, espe- 
cially if you want to say doJlar values 
itHM') or sven phone nuimiwTs (to ex- 
iunple, 1 say the bst iour di||ts of xnf 



RECOGNIZER 



[ACTI0N1=BUY 
[PRICEI=$4.50 
tQUANTITYl=100 



C0NFIDENCE=98% 
CONFIDENCE=96% 
C0NFIDeNCE=97% 
C0NFIDENCE=63% 



OVERALL C0NFIDENCE=72% 
[QUERY]="D0 YOU WANTftC?* [QUERY]="WHICH STOCK DQ VOL WANT?' 

RESPONSE: "NO, NCC." RESPONSE: "NCC 



RETOGWiZER 



[ST0CK]=NCC 



[ACTI0N1=BUY 
[PRICE1=$4.50 
[QUANTITY1=100 
!STOCtU=NCC 



C0NFIDENCE=98% 
CONFID£NCE=%% 
C0NFIDENCE=97% 
GQNF:D£NCE=97% 



OVERALL CdNnt>£MCE=-57% 

In singlephase correction, a 72t^ confidence score for a complete 
response is too low to assume that the system has understood the 
response. The low confidence of [stock], however, brings down the over 
aU score. Conffrming or asking the stock choice again increases the con- 
fidence of [stock] and thus raises the overall confidence to a level suffi- 
cient for the system to act. Note that, generally, each semantic slot has 
several recognized meanings, listed by declining confidence. The gram- 
mar selects tKe most likely njeaninj; ba$ed on confidence and context. 



phone number as "eighty-nine-oh-six" 
for 8906). If you need custom words, your 
speech-recognition supplier can probably 
build models for you, but remember that 
guch services can take g<*yafa] months. 

DIAtOHUE TO«Li 

Even more important than the size of 
your vocabulary is how you use that vo- 
cabulary. You cart buy a turnkey speech- 
recognition system, complete with 
boards, running code, application-pro- 
gramming interfaces, drivers, and vocab- 
ularies, but you have to face what may be 
tEs mt)St clialleaging phase of design; 
creadnf the jnterfaoe ^kl^gue. 



When the recognition en- 
gine converts speech to text, 
the engine often passes on sev- 
eral possible tigxt answers, each 
with a conSdence rating. Now, 
natural understanding does 
not imply complex artificial 
intelligence; the engine does- 
n't "understand" what a speak- 
er is saying. Rather, dialogue 
tools generate grammars, 
which attempt to validate the 
possible answers based on 
contejct; select the most Ktely 
response; and then map this 
response to a direct action, 
such as dialing a phone. 

Dialogue tools, such as 
th5se from Unisys and many 
of the companies offering 
speech-recognition engines, 
effectively build state ma- 
chines. From any state in the 
potehtial coTTverMtion, con- 
trol can transfer to another state based on 
certain input values. Unfortunately, most 
of these tools are still text-based and do 
not offer a graphical-user-interface, 
high-level view of the constrnctf d dia- 
logue. Thus, visualizing the myriad paths 
a conversation can take can be difficult. 

Good dialogue tools help speed the de- 
velopment of a system and allow you to 
perform a dry run on a s^'stein with wfce 
files, letting test users actually talk to the 
system. For custom systems, your cus- 
tomers can directly experience the "listen 
and feel." Currently, the available dia- 
logue toeb do not aulocodc, but the 
modds are etrngbtforwiTd to modify 



THE OTHER HALF OF THE CONVERSATION: SPEECH SYNTHESIS 

Fundamental to building a use- 



i ful speech-recognition system is 
the ability to communicate back 
to the user Realistically, there is 
no such thing as 100%-accurate 
recognition. Even between two 
j people, confusion occasionally 
' arises in a conversation. Ad- 
ditionally, sometimes your 
tongues slip, and you say com- 
pletely different worcfe from 
those you intended. For times 
like these, you must be able to 
requen/ the user. 

Certainly, you can flash a light 
Of be^ to indicate an mor, bi[t 



the most natural vwy to eommu- 

nicate an error in response to 
speech is with speefii. Addi- 
tionally, if speech recoi.:r:inon 
frees the hands, speech synthe- 
sis frees the eyes. A doctor, for 
example, may be unal: 'e ''^ look 
away dur^i^ : proccujre. In 
such a situation, the doctor 
could ask an instrument with 
speech synthesis w(-iat the instru- 
ment's status is. 
Two kinds of speech synthesis 

exist: r 2Cui ' ^ n ",^j,;r ij 

real-time, text-to-speecli conver- 
sion. Today, recot^d speech 



sounds better than synthesized 

speech becauw' 'lif hum,]n 
wliose voire is recordea knows 
the cimiox; of the mess.:v.? and 
can add appropriate emphasis. 
Syntheizere face the challenge 
of naturally .'•.r' J ' rds 
togettier, alilh ..iittliing 
recorded words can give you 
lower quality than synthesized 
stitching because s\titheiiza5 
can make some contextual 
adjustments. 

Cui'y ... ii as Eifii! iiiid 
Acuvoice have applied many 
resources to |eneratinp synthe- 



stzed speech {text to speech), 
which sounds more like a per- 
son than a machine. Some of 
these companies' achievements 
include the addition of inflection 
and nuances in pronunciation, 
context (reading a date as a date 
rather than as numbers and 
slashes), approphate pauses, 
personality in the voice, and oth- 
ers. Fluent Speech even offers a 
graphical head that mouths 
words the way a person would. 
Note that speech synthesis is not 
standard in most recognition 
padtaaes. 
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and provide a detailed specification for 
the development team. (Autocod- 
ing is coming soon.) 



Figurt f 



Building dialogues is not easy if you 
want to work on a natural-understand- 
ing level. For example, there are easily 50 
synonyms or aliases for saying "yes." Al- 
though you could force a user to use 
"yes," doing so then requires that a user 
sprnd time learning the correct vocabu- 
lary for your system. In applications in 
which a user may call in only once, train- 
ing is not an option. In any case, sup- 
porting reasonable aliases makes your 
systerti easier to use and gives it a com- 
petitive edge over rigid systsms thai fe* 
quire set responses. 

OBJECTS IN SPEECH 

You can also parse spoken sentences 
into semantic slots (Figure 1). For ex- 
ample, a simple stock transaction re- 
quires several pieces of information: the 
action, the quantity, the stock, and the 
price. These semantic slots can oc- I 
cur in a variety of orders. The 
grammar — the states, prompts, respons- 
es, and vocabulary that the dialogue tool 
Creates — knows that all four pieces of in- 
formation are necessary for a complete 
command. One advantage of using se- 
mantic slots is single-phrase correction 
(Figure 2). For example, if the confi- 
dence of the complex phrase, "Buy 100 
shares of NCC at $4.50,'* is low, one of 
the semantic slots, such as NCC, may be 
low and thus drops the overall score. The 
grammar engine can evaluate that the 
other semantic slots have a high confi- 
dence and that only the [ stock} slot is in 
question. The grammar could then ask, 
"WTiich stock?" without making the user 



[DAY]= 



{1..31, 
1ST,.31ST, 
1+0, 1 + 1. .3+1} 



IDATEl=f^O'^TH]+[DAY] 
tDAYWMONTH] 



RECOGNIZED RESPONSES: 
JAN 10 

lOTH OF JANUARY 
1/1+0 

(yQNTH]={JANUARY..DECEMB£R, 
JAN. .DEC, 
1..12, 

1+0, 1+1, 1+21 



Semantic slots are objects you can use in sentences of the grammar to represent many responses 
without enumerating each possibility. Slots may contain other slots. For example, [DATE] uses 
[DAY] and IMONTH}. Nota that I -I- represents distinctly speaking each digit: "one, zero" for 10. 



GRAMWAfl: (F>eC£»NIZED RESPOWSE =»■ ACTIOM) 

CALL (LIST) =»-CALLTHE NUMBER ASSOCIATED WITH THE RECOGNIZED NAME 
CHANGE NUMBER FOR [LIST! =>■ EDIT THE NUMBER ASSOCIATED WtTH THE RECOGNIZED fiAm 
WHO IS [UST1=»^ READ OUT THE NUMBER ASSCKIATEC WITH THE RECOGNIZED NAME 

PHOHtLIST IHOFFICE, 1-510^558-8906; 

CONRAD, 1-510-558-89141 

fPHONELIST 2W0FFICE, 1-617-757-2676; 

MARTA, 1-617-767-2677) 



Figure 4 



(IJSTMWONE. LIST 11- 



[LISTl^IPHONE LIST 21- 



- 'CALL QFFICE'=»' DIAL 1-510-568-8906 
■CALL CONRAD- =*■ DIAL 1-510-558-8914 

■CHANGE NUMBER FOR OFFICE^^*-EDIT 1-510-558-8906 
■CHANGE NUMBER FOR CON RAD^ =>- EDIT 1-510-558-891* 
■WHO IS OFFICE?"=*' READOUT 1-510-558-8906 
•WHO IS CONRAD?' READ OUT 1-510-558-8914 

- 'CALL OFFlCE-=*- DIAL 1-617-767-2675 
■CALL MARTA' =*-DIAL 1-617-767-2677 

■CHANGE NUMBER FOR OFFICE' =>- EDIT 1-617-767-2676 
■CHANGE NUMBER FOR MARTA'=*-EDIT 1-617-767-2677 
■WHO IS QIFFICE?' s=>- READ OUT l-5i7-767-267S 

•ma w wiwTAr bsao-but i.«i7-?b7-w> 



By defining the grammar with the object [list], you can change contexts-in this case, the current 
use^Hsnhoit having to r«des^ a «ytole new pwmm. 



repeat the entire command. 

Semantic slots also reduce the work 
you need to do to design a dialogue. In- 
stead of having to define a sentence for 
the grammar for each possible response, 
you can use semantic slots. For example, 
instead of "Say 1 to call mom," "Say 2 to 
call home," and so on, you can use "Say 



1 to cad (namtl' Semantic s-lots are ob- 
jects that you can use in other sentences 
of the grammar and can include objects 
themselves (Figure 3). Slots can also ease 
the challenge of creating dynamic gram- 
mars based on different contexts. For ex- 
ample, you can personalize a grammar by 
using different phone lists; in either case. 



KNOWS HIS MASTER'S VOICE 



Companies such as Voice 
Corttol Systems offer voice veri- 
fication with their recognition 
engines, so these system both 
recognize and verify \ 
Several camps argut; ini- I'^i ml'iI- 
ity of wire riuiherititdtion It 
users identily therriKlves over a 
less-tharvperfect source, such as 
a telephone, the sample to verify 
is low -quality. Such verification 
systems ate easy to fool using 
recorded responses. One 

$§ BOH I l\mM. 14, Lf^^ 



method for stalling such attacks 
is to require users to answer rji i 
dom prompts that are difficult to 
anticipate and pferecord. tinfess 
you have a powerful voice 
engine finked with other reliable 
biometrics, you might cofjsider 
speaker verification about as 
secure as a car iock; Anyone 
whi ":nts to can break in. This 
frii.; N ;.:.t tP say that verification 
is witfiout value. It keeps honest 
opie honest; LcKiang your eeii 



phone Of digital diary with a ver- 

.ss 'd keeps fflJt the 
casual snoop. 

Verification focuses on false 
af '-.^-:-iLt"^ce inc) false rej^dion. If 
you Ld;e about ^-ecujiiy, yoti 
want to false-reject users more 
often than you false-accept 
them. However, t>eware of 
falling into the tra-) r;\ .i.-signing 
the "perfect" system at liie 
expense of ease of use. By 
deigning a system to account 



for every slicky possibility, you 
m<iy nidke your systcTn difficult 
for the average person to use; if 
a system keeps rejecting a user, 
tht?t usfT will stof) usin.f the sys- 
tem. In many ca.ses, convenience 
is a key factor in using voice, 
and you may be bettef off rely- 
ing on an alternative fonn of 
verification, such as personal- 
identification numbers or pass- 
vwjfds. 




designfeafotS JprnQhsemgnUian. 




the form and structure of the prompts 
and responses stay the same, based on the 
[firt] slot (Figure 4,). 

grFTfNGTHE RIGHT BALANCE 

The trick to building a good dialogue 
is finding the right balance. Allowing 
tuws ta lay too much witfaoMt direction 
cafl scare them, but constricting options 
can be frustrating. Also note that, al- 
though dialogue tools seem best suited 
for complex applications, many of the 
same fijndamentals apply to low-end, 
command-and-control applications. An 
engine supporting a vocabulary of 100 
words probably doesn't need to recognize 
variations of sentences. However, the is- 
sues of aliases, context, conversational in- 
terface, and craftsmanship of the com- 
plete grammar apply just as — if not even 
mare — fltrongly bsctuse fau have fes 



room to make sure you've covered all the 
ways people can respond. The best inter- 
faces reduce, not increase, user errors, 
and the true measure of a successful im- 
plementation is achieving invisibility to 
the user. 

Unfortunately, desktop dictation sys- 
tems, such as IBM's ViaVoice and Drag- 
on's Naturally Speaking have created 
mixed consumer expectations for 
speech-recognition technology. (For a 
description of my challenges to 
write portions of this article with 
speech-recognition technology, 
check out the electronic version 
of this article at www. edn- 
mag.com.) Dictation is the hard- 
est of all speech-recognition ap- 
plications in that, often, no 
context exists; sentences can be 
about any subject and can us« 
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any word in the vocabulary. As a conse- 
quence, inaccuracies — 95% accuracy re- 
sults in one error in every 20 words— are 
confusing, frustrating, and annoying, 
and they tend to reduce user expecta- 
tions. 

For nondictation applications, howev- 
er, the future sounds exciting. Compani*s 
that used to offer only turnkey products 
have started to sell their low-level tech- 
nology. Chip companies often offer a 
^ software version of the algo- 
rithm, and boards and dia- 
logue tools can work with a va- 
riety of speech-recognition 
engines. This situation is good 
news; it means that speech 
recognition is shifting from a 
"complete-solution" industry 
to one offering best-of-breed 
components. O 
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